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Cross-situational word learning is based on the notion that a learner can determine the 
referent of a word by finding something in common across many observed uses of that 
word. Here we propose an adaptive learning algorithm that contains a parameter that 
controls the strength of the reinforcement applied to associations between concurrent 
words and referents, and a parameter that regulates inference, which includes built-in 
biases, such as mutual exclusivity, and information of past learning events. By adjusting 
these parameters so that the model predictions agree with data from representative 
experiments on cross-situational word learning, we were able to explain the learning 
strategies adopted by the participants of those experiments in terms of a trade-off 
between reinforcement and inference. These strategies can vary wildly depending on 
the conditions of the experiments. For instance, for fast mapping experiments (i.e., 
the correct referent could, in principle, be inferred in a single observation) inference is 
prevalent, whereas for segregated contextual diversity experiments (i.e., the referents 
are separated in groups and are exhibited with members of their groups only) 
reinforcement is predominant. Other experiments are explained with more balanced 
doses of reinforcement and inference. 
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1. INTRODUCTION 

A desirable goal of a psychological theory is to offer explanations 
grounded on elementary principles to the data available from 
psychology experiments (Newell, 1994). Although most of these 
quantitative psychological data are related to mental chronom- 
etry and memory accuracy, recent explorations on the human 
performance to acquire an artificial lexicon in controlled labo- 
ratory conditions have paved the way to the understanding of the 
learning strategies humans use to infer a word-object mapping 
(Yu and Smith, 2007; Kachergis et al., 2009; Smith et al., 2011; 
Kachergis et al, 2012; Yu and Smith, 2012a). These experiments 
are based on the cross-situational word-learning paradigm which 
avers that a learner can determine the meaning of a word by find- 
ing something in common across all observed uses of that word 
(Gleitman, 1990; Pinker, 1990). In that sense, learning takes place 
through the statistical sampling of the contexts in which a word 
appears in accord with the classical associationist stance of Hume 
and Locke that the mechanism of word learning is sensitivity to 
covariation: if two events occur at the same time, they become 
associated (Bloom, 2000). 

In a typical cross-situational word-learning experiment, par- 
ticipants are exposed repeatedly to multiple unfamiliar objects 
concomitantly with multiple spoken pseudo-words, such that a 
word and its correct referent (object) always appear together on a 
learning trial. Different trials exhibiting distinct word-object pairs 
will eventually allow the disambiguation of the word-object asso- 
ciations and the learning of the correct mapping (Yu and Smith, 
2007). However, it is questionable whether this scenario is suitable 



to describe the actual word learning process by children even in 
the unambiguous situation where the single novel object is fol- 
lowed by the utterance of its corresponding pseudo-word. In fact, 
it was shown that young children will only make the connection 
between the object and the word provided they have a reason 
to believe that they are in presence of an act of naming and for 
this the speaker has to be present (Baldwin et al., 1996; Bloom, 
2000; Waxman and Gelman, 2009). Adults could learn those asso- 
ciations either because they were previously instructed by the 
experimenter that they would be learning which words go with 
which objects or because they could infer that the disembodied 
voice is an act of naming by a concealed person. Although there 
have been claims that cross-situational statistical learning is part 
of the repertoire of young word learners (Yu and Smith, 2008), the 
effect of individual differences in attention and vocabulary devel- 
opment of the infants complicates considerably this issue which 
is still a matter for debate (Yu and Smith, 2012b; Smith and Yu, 
2013). 

There are several other alternative or complementary 
approaches to the statistical learning formulation of language 
acquisition considered in this paper. For instance, the social- 
pragmatic hypothesis claims that the child makes the connections 
between words and their referents by understanding the refer- 
ential intentions of others. This approach, which seems to be 
originally due to Augustine, implies that children use intuitive 
psychology to "read" the adults' minds (Bloom, 2000). A more 
recent approach that explores the grounding of language in 
perception and action has been proved effective in the design of 
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linguistic capabilities in humanoid cognitive robots (Cangelosi 
et al., 2007; Cangelosi, 2010; Pezzulo et al., 2013) as well as in the 
support of word learning by toddlers through the stabilization 
of their attention on the selected object (Yu and Smith, 2012b). 
In contrast with the unsupervised cross-situational learning 
scheme, the scenario known as operant conditioning involves 
the active participation of the agents in the learning process, 
with exchange of non-linguistic cues to provide feedback on 
the learner inferences. This supervised learning scheme has 
been applied to the design of a system for communication by 
autonomous robots in the Talking Heads experiments (Steels, 
2003). We note that a comparison between the cross-situational 
and operant conditioning learning schemes indicates that 
they perform similarly in the limit of very large lexicon sizes 
(Fontanari and Cangelosi, 201 1). 

As our goal is to interpret the learning performance of adults 
using a few plausible reasoning tenets, here we assume that in 
order to learn a word-object mapping within the cross-situational 
word-learning scenario the learner should be able to (i) recall 
at least a fraction of the word-object pairings that appeared in 
the learning trials, (ii) register both co-occurrences and non- 
co-occurrences of words and objects and (iii) apply the mutual 
exclusivity principle which favors the association of novel words 
to novel objects (Markman and Wachtel, 1988). Of course, we 
note that a hypothetical learner could achieve cross-situational 
learning solely by registering and recalling co-occurrences of 
words and objects without carrying out any inferential reason- 
ing (Blythe et al, 2010; Tilles and Fontanari, 2012a), but we find 
it implausible that human learners would not reap the benefits 
(e.g., fast mapping) of employing mutual exclusivity (Vogt, 2012; 
Reisenauer et al, 2013). 

In this paper we offer an adaptive learning algorithm that 
comprises two parameters which regulate the associative rein- 
forcement of pairings between concurrent words and objects, and 
the non-associative inference process that handles built-in biases 
(e.g., mutual exclusivity) as well as information of past learning 
events. By setting the values of these parameters so as to fit a rep- 
resentative selection of experimental data presented in Kachergis 
et al. (2009, 2012) we are able to identify and explain the learn- 
ing strategies adopted by the participants of those experiments in 
terms of a trade-off between reinforcement and inference. 

2. CROSS-SITUATIONAL LEARNING SCENARIO 

We assume there are N objects o\ , . . . , on, N words w\ , . . . , wn 
and a one-to-one mapping between words and objects repre- 
sented by the set F = {(w\, oi) , . . . , (»n, on)}- At each learn- 
ing trial, C word-object pairs are selected from T and pre- 
sented to the learner without providing any clue on which 
word goes with which object. For instance, pictures of the 
C objects are displayed in a slide while C pseudo-words 
are spoken sequentially such that their spatial and temporal 
arrangements do not give away the correct word-object asso- 
ciations (Yu and Smith, 2007; Kachergis et al, 2009). We 
refer to the subset of words and their referents (objects) pre- 
sented to the learner in a learning trial as the context Q = 
[w\ , oi, W2, 02, . . . , wc, oc}- The context size C is then a measure 
of the within-trial ambiguity, i.e., the number of co-occurring 



word-object pairs per learning trial. The selection procedure 
from the set T, which may favor some particular subsets of 
word-object pairs, determines the different experimental setups 
discussed in this paper. Although each individual trial is highly 
ambiguous, repetition of trials with partially overlapping con- 
texts should in principle allow the learning of the N word-object 
associations. 

After the training stage is completed, which typically com- 
prises about two dozen trials, the learning accuracy is measured 
by instructing the learner to pick the object among the N objects 
on display which the learner thinks is associated to a particular 
target word. The test is repeated for all N words and the aver- 
age learning accuracy calculated as the fraction of correct guesses 
(Kachergis etal, 2009). 

This cross-situational learning scenario does not account for 
the presence of noise, such as the effect of out-of-context words. 
This situation can be modeled by assuming that there is a cer- 
tain probability (noise) that the referent of one of the spoken 
words is not part of the context (so that word can be said to be 
out of context). Although theoretical analysis shows that there 
is a maximum noise intensity beyond which statistical learning 
is unattainable (Tilles and Fontanari, 2012b), as yet no exper- 
iment was carried out to verify the existence of this threshold 
phenomenon on the learning performance of human subjects. 

3. MODEL 

We model learning as a change in the confidence with which 
the algorithm (or, for simplicity, the learner) associates the word 
w, to an object o; that results from the observation and analy- 
sis of the contexts presented in the learning trials. More to the 
point, this confidence is represented by the probability PtyWi, Oj) 
that w, is associated to Oj at learning trial t. This probability 
is normalized such that P t (w;, Oj) = 1 for all Wj and t > 0, 
which then implies that when the word w,- is presented to the 
learner in the testing stage the learning accuracy is given simply by 
Pt(wi, Oi). In addition, we assume that Pt(wi, Oj) contains infor- 
mation presented in the learning trials up to and including trial t 
only. 

If at learning trial t the learner observes the context Q t = 
{w\ , oi, W2, 02, ■ ■ . , wc, oc) then it can infer the existence of two 
other informative sets. First, the set of the words (and their refer- 
ents) that appear for the first time at trial f, which we denote by 

n t = \w\, oi, W2, 02, ■ ■ ■ , w c , OqJ. Clearly, £2 f C fi f and Q < 
C. Second, the set of words (and their referents) that do not 
appear in Q, t but that have already appeared in the previous trials, 
Q, t = {ivi, oi, . . . , wn,-c, on,-c] where N t is the total number 
of different words that appeared in contexts up to and including 
trial t. Clearly, fi f n S2 t = 0. The update rule of the confidences 
Ptiwi, oj) depends on which of these three sets the word w; and 
the object o ; belong to (if i / j they may belong to different sets). 
In fact, our learning algorithm comprises a parameter x. € [0, 1] 
that measures the associative reinforcement capacity and applies 
only to known words that appear in the current context, and 
a parameter p e [0, 1] that measures the inference capacity and 
applies either to known words that do not appear in the cur- 
rent context or to new words in the current context. Before the 
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experiment begins (f = 0) we set Pq (wi, oj) = 0 for all words ve; 
and objects Oj. Next we describe how the confidences are updated 
following the sequential presentation of contexts. 

In the first trial (f = 1) all words are new (Q = N\ = C), so 
we set 

Pi(wi,bj) = ^ (1) 

for Wj, Oj G = £2. In the second or in an arbitrary trial t we 
expect to observe contexts exhibiting both novel and repeated 
words. Novel words must go through an inference preprocess- 
ing stage before the reinforcement procedure can be applied to 
them. This is so because if w; appears for the first time at trial t 
then P t - i (wi, oj) = 0 for all objects Oj and since the reinforce- 
ment is proportional to P t - i (w,, oj) the confidences associated 
to Wj would never be updated (see Equation 5 and the explana- 
tion thereafter). Thus, when a novel word iv, appear at trial f > 2, 
we redefine its confidence values at the previous trial (originally 
set to zero) as 



/- ~\ P 1-P 
Pt-i Oj) = — + 



P t -l(w„ Oj) = 

P t -i(w„ Oj) = 



Q Nt-i + Ct 

1-P 
Nt-i + Ct' 

1-P 
Nt-i + Q 



(2) 
(3) 
(4) 



On the one hand, setting the inference parameter |3 to its maxi- 
mum value p = 1 enforces the mutual exclusivity principle which 
requires that the new word iv,- be associated with equal probabil- 
ity to the C t new objects Oj in the current context. Hence in the 
case C t = 1 the meaning of the new word would be inferred in a 
single presentation. On the other hand, for |3 = 0 the new word 
is associated with equal probability to all objects already seen up 
to and including trial f, i.e., N t = N t -i + C t . Intermediate val- 
ues of P describe a situation of imperfect inference. Note that 
using Equations 2-4 we can easily verify that J^j P t - i (w;, oj) + 

J2 0j p t- l (wj, Oj) + J^dj p t- l (wj, dj) = 1, in accord with the 
normalization constraint. 

Now we can focus on the update rule of the confidence 
P t (wi, Oj) in the case both word w; and object Oj appear in 
the context at trial f. The rule applies both to repeated and 
novel words, provided the confidences of the novel words are 
preprocessed according to Equations 2-4. In order to fulfill auto- 
matically the normalization condition for word w t , the increase of 
the confidence P t (w,-, Oj) with Oj € f2 f must be compensated by 
the decrease of the confidences P t (w,-, dj) with dj e Q, t . This can 
be implemented by distributing evenly the total flux of probabil- 
ity out of the latter confidences, i.e., X^eQ, p t- l oj), over 

the confidences Pt (w;, Oj) with Oj e £2 t . Hence the net gain of 
confidence on the association between w; and Oj is given by 

/ \ , , E- 0j en t Pt-i{wi,Oj) 
Tt- 1 (Wj, Oj) =xPt-i (Wi, Oj) ^ j—t (5) 



where, as mentioned before, the parameter x € [0, 1] measures 
the strength of the reinforcement process. Note that if both Oj 
and Ok appear in the context together with w; then the reinforce- 
ment procedure should not create any distinction between the 
associations (w,, Oj) and (w;, o^). This result is achieved provided 
that the ratio of the confidence gains equals the ratio of the con- 
fidences before reinforcement, i.e., r t _ Oj) /r t _ Ok) = 
Pt- Oj) /P t - i(w t , 0^). This is the reason that the reinforce- 
ment gain of a word-object association given by Equation 5 
is proportional to the previous confidence on that association. 
The total increase in the confidences between w t and the objects 
that appear in the context, i.e., ^2„. € Q t r t _ i(w;, oj), equals the 
product of x and the total decrease in the confidences between 
Wi and the objects that do not appear in the context, i.e., 
e Pt- 1 (wi, dj). So for x. = 1 the confidences associated to 
objects absent from the context are fully transferred to the confi- 
dences associated to objects present in the context. Lower values 
of x. allows us to control the flow of confidence from objects in 
Q, t to objects in Q, t . 

Most importantly, in order to implement the reinforcement 
process the learner should be able to gauge the relevance of the 
information about the previous trials, which is condensed on the 
confidence values Pt(w;, Oj). The gauging of this information is 
quantified by the word and trial dependent quantity a t (w,) e 
[0, 1] that allows for the interpolation between the cases of maxi- 
mum relevance (a t (w,) = 1) and complete irrelevancy (a t (w,) = 
0) of the information stored in the confidences P t (w,-, Oj). In par- 
ticular, we assume that the greater the certainty on the association 
between word w; and its referent, the more relevant that informa- 
tion is to the learner. A quantitative measure of the uncertainty 
associated to the confidences regarding word w, is given by the 
entropy 

H t (wd = - Pt{m,Oj)\o S [P t (wi,Oj)] (6) 

whose maximum (logJV f ) is obtained by the uniform distribution 
Pt(wi, Oj) = 1/N t for all Oj € £2 t U ^f, and whose minimum (0) 
by Pt(wi, Oj) = 1 and P t (w;, ojt) = 0 for o^ ^ Oj. So we define 



a,(wi) = a 0 + (1 - oio) 



Ht(wj) 
log N, 



(7) 



where an € [0, 1] is a baseline information gauge factor corre- 
sponding to the maximum uncertainty about the referent of a 
target word. 

Finally, recalling that at trial t the learner has access to the sets 
Q t , £2 t as well as to the confidences at trial t — 1 we write the 
update rule 



Pt(wh Oj) = P t -i(wi, Oj) + a t -i(wi)r t - 1 oj) 



+ [1 -o ( _i (w,)] 



(8) 



for Wi, Oj € f2 f . Note that if a t -\(wi) = 0 the learner would 
associate word w; to all objects that have appeared up to and 
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including trial t with equal probability. This situation happens 
only if an = 0 and if there is complete uncertainty about the ref- 
erent of word Wj. Hence the quantity a t (wi) determines the extent 
to which the previous confidences on associations involving word 
Wi influence the update of those confidences. 

Now we consider the update rule for the confidence P t (wj, dj) 
in the case that word ve; appears in the context at trial t but object 
dj does not. (We recall that object o ; must have appeared in some 
previous trial.) According to the reasoning that led to Equation 5 
this confidence must decrease by the amount X.Pf- \{w,, dj) and 
so, taking into account the information gauge factor, we obtain 



P t (w„ Oj) = P t -i(w„ Oj) - a t _ i(wi) xP t -i{w„ Oj) 

1 



+ [l-a t _ 1 (w;)] 



N t 



Pf-l(w„ Oj) 



which can be easily seen to satisfy the normalization 

E p t( w i>°j)+ E p»K5j) = i. 

Oj e £2* dj € £2, 



(9) 



(10) 



We focus now on the update rule for the confidence P t (w;, Oj) 
with Wi, dj e f2(, i.e., both the word w,- and the object dj are 
absent from the context shown at trial t, but they have already 
appeared, not necessarily together, in previous trials. A similar 
inference reasoning that led to the expressions for the preprocess- 
ing of new words would allow the learner to conclude that a word 
absent from the context should be associated to an object that is 
also absent from it. In that sense, confidence should flow from the 
associations between iv; and objects Oj e Q, t to the associations 
between iv; and objects dj € Q, t - Hence, ignoring the information 
gauge factor for the moment, the net gain to confidence Pf (iv;, dj) 
is given by 

r,-i(wi, Oj) = pP t -i(w„ Oj) — > — — . (11) 

2-,- 0 j e a t Pt-i[wi,oj) 

The direct proportionality of this gain to P t _ i (w;, dj) can be jus- 
tified by an argument similar to that used to justify Equation 5 in 
the case of reinforcement. The information relevance issue is also 
handled in a similar manner so the desired update rule reads 

Pt(wi, dj) = P t -\(wu dj) + a t -i(wi)r t -i(wi, dj) 



+ [1 - a f _i(w;)] 



Jf- p t-ifaoj) 



(12) 



for wi, dj € Q. t . To ensure normalization the confidence 
P t (wj, Oj) must decrease by an amount proportional to 
p\P f _ i (wi, Oj) so that 



Pt(wi, Oj) = P t _i(w,-, Oj) - a t _i(w;) pP f - i(w„ Oj) 

~ 1 



+ [l-a f _i(w ; )] 



P t -l{wi,Oj) 



for Wi € J2 t and Oj e Q t . We can verify that prescriptions (12) 
and (13) satisfy the normalization 



(14) 



Oj € Qf 



Oj e Q t 



(13) 



as expected. 

In summary, before any trial (t = 0) we set all confidence 
values to zero, i.e., Po(w,-, Oj) = 0, and fix the values of the param- 
eters oto, X an d P- m the first trial (t = 1) we set the confidences 
of the words and objects in Q,i according to Equation (1), so we 
have the values of Pi(w,-, Oj) for w,, o } e Q.\. In the second trial, 
we separate the novel words iv; G Q2 and reset Pi(w,-, Oj) with 
0; e f22 U ^2 according to Equations 2-4. Only then we calculate 
oti(wi) with Wj € J]j 11^2 using Equation (7). The confidences 
at trial t = 2 then follows from Equations (8), (9), (12), and (13). 
As before, in the third trial we separate the novel words w; e ^3, 
reset P2(w l , Oj) with 0, e ^3 U ^3 according to Equations 2-4, 
calculate oi2(w;) with w, € Q,\ U 0,2 U £^3 using Equation (7), 
and only then resume the evaluation of the confidences at trial 
f = 3. This procedure is repeated until the training stage is com- 
pleted, say, at f = t * . At this point, knowledge of the confidence 
values P f *(w,, oj) allows us to answer any question posed in the 
testing stage. 

Our model borrows many features from other proposed mod- 
els of word learning (Siskind, 1996; Fontanari et al, 2009; Frank 
et al, 2009; Fazly et al, 2010; Kachergis et al, 2012). In particu- 
lar, the entropy expression (6) was used by Kachergis et al. (2012) 
to allocate attention trial-by-trail to the associations presented in 
the contexts. Here we use that expression to quantify the uncer- 
tainty associated to the various confidences in order to determine 
the extent to which those confidences are updated on a learning 
trial. A distinctive feature of our model is the update of asso- 
ciations that are not in the current trial according to Equation 
( 12). In particular, we note that whereas ad hoc normalization can 
only decrease the confidences on associations between words and 
objects that did not appear in the current context, our update rule 
can increase those associations as well. The extent of this update 
is weighted by the inference parameter f5 and it allows the appli- 
cation of mutual exclusivity to associations that are not shown in 
the current context. In fact, the splitting of mental processes in 
two classes, namely, reinforcement processes that update associ- 
ations in the current context and inference processes that update 
the other associations is the main thrust of our paper. In the next 
section we evaluate the adequacy of our model to describe a selec- 
tion of cross-situational word-learning experiments carried out 
on adult subjects by Kachergis et al. (2009, 2012). 

4. RESULTS 

The cross-situational word-learning experiments of Kachergis 
et al. (2009, 2012) aimed to understand how word sampling 
frequency (i.e., number of trials in which a word appears), contex- 
tual diversity (i.e., the co-occurrence of distinct words or groups 
of words in the learning trials), within-trial ambiguity (i.e., the 
context size C), and fast-mapping of novel words affect the learn- 
ing performance of adult subjects. In this section we compare 
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the performance of the algorithm described in the previous sec- 
tion with the performance of adult subjects reported in Kachergis 
et al. (2009, 2012). In particular, once the conditions of the train- 
ing stage are specified, we carry out 10 4 runs of our algorithm 
for fixed values of the three parameters do, |3, x> an d then cal- 
culate the average accuracy at trial t = t* over all those runs 
for that parameter setting. Since the algorithm is deterministic, 
what changes in each run is the composition of the contexts at 
each learning trial. As our goal is to model the results of the 
experiments, we search the space of parameters to find the set- 
ting such that the performance of the algorithm matches that of 
humans within the error bars (i.e., one standard deviation) of the 
experiments. 

4.1. WORD SAMPLING FREQUENCY 

In these experiments the number of words (and objects) is N = 
18 and the training stage totals t* = 27 learning trials, with each 
trial comprising the presentation of 4 words together with their 
referents (C = 4). Following Kachergis et al. (2009), we inves- 
tigate two conditions which differ with respect to the number 
of times a word is exhibited in the training stage. In the two- 
frequency condition, the 18 words are divided into two subsets 
of 9 words each. The words in the first subset appear 9 times and 
those in the second only 3 times. In the three-frequency condi- 
tion, the 18 words are divided into three subsets of 6 words each. 
Words in the first subset appear 3 times, in the second, 6 times 
and in the third, 9 times. In these two conditions, the same word 
was not allowed to appear in two consecutive learning trials. 

Figures 1, 2 summarize our main results for the two-frequency 
and three-frequency conditions, respectively. The left panels show 
the regions (shaded areas) in the (/., P) plane for fixed otn where 
the algorithm describes the experimental data. We note that if 
those regions are located left to the diagonal x = P then the infer- 
ence process is dominant whereas if they are right to the diagonal 
then reinforcement is the dominant process. The middle panels 
show the accuracy of the best fit as function of the parameter an 



and the right panels exhibit the values of x. and p corresponding 
to that fit. The broken horizontal lines and the shaded zones 
around them represent the means and standard deviations of 
the results of experiments carried out with 33 adult subjects 
(Kachergis et al, 2009). 

It is interesting that although the words sampled more fre- 
quently are learned best in the two-frequency condition as 
expected, this advantage practically disappears in the three- 
frequency condition in which case all words are learned at equal 
levels within the experimental error. Note that the average accu- 
racy for the words sampled 3 times is actually greater than 
the accuracy for the words sampled 6 times, but this inver- 
sion is not statistically significant, although, most surprisingly, 
the algorithm does reproduce it for an e [0.7,0.8]. According 
to Kachergis et al. (2009), the reason for the observed sam- 
pling frequency insensitivity might be because the high-frequency 
words are learned quickly and once they are learned subse- 
quent trials containing those words will exhibit an effectively 
smaller within-trial ambiguity. In this vein, the inversion could 
be explained if by chance the words less frequently sampled were 
generally paired with the highly sampled words. Thus, contex- 
tual diversity seems to play a key role in cross-situational word 
learning. 

4.2. CONTEXTUAL DIVERSITY AND WITHIN-TRIAL AMBIGUITY 

In the first experiment aiming to probe the role of contextual 
diversity in the cross-situational learning, the 18 words were 
divided in two groups of 6 and 12 words each, and the contexts of 
size C = 3 were formed with words belonging to the same group 
only. Since the sampling frequency was fixed to 6 repetitions for 
each word, those words belonging to the more numerous group 
are exposed to a larger contextual diversity (i.e., the variety of dif- 
ferent words with which a given word appear in the course of 
the training stage). The results summarized in Figure 3 indicate 
clearly that contextual diversity enhances the learning accuracy. 
Perhaps more telling is the finding that incorrect responses are 




i 0 =0.94 

/t 



0.0 0.2 0.4 0.6 



FIGURE 1 | Summary of the results for the two-frequency condition 
experiment. Left panel: Regions in the plane (x, P) where the algorithm fits 
the experimental data for fixed ao as indicated in the figure. Middle panel: 

Average accuracy for the best fit to the results of Experiment 1 of Kachergis 
et al. (2009) represented by the broken horizontal lines (means) and shaded 



regions around them (one standard deviation). The blue symbols represent 
the accuracy for the group of words sampled 9 times whereas the red 
symbols represent the accuracy for the words sampled 3 times. Right panel: 
Parameters x and fi corresponding to the best fit shown in the middle panel. 
The other parameters are N = 18 and C = 4. 
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FIGURE 2 | Summary of the results for the three-frequency condition 
experiment. Left panel: Regions in the plane (x, P) where the algorithm fits 
the experimental data for fixed arj as indicated in the figure. Middle panel: 

Average accuracy for the best fit to the results of Experiment 1 of Kachergis 
et al. (2009) represented by the broken horizontal lines (means) and shaded 



regions around them (one standard deviation). The blue symbols represent 
the accuracy for the group of words sampled 9 times, the green symbols for 
the words sampled 6 times, and the red symbols for the words sampled 3 
times. Right panel: Parameters x and p corresponding to the best fit shown 
in the middle panel. The other parameters are W = 18 and C = 4. 




FIGURE 3 | Summary of the results of the two-level contextual 
diversity experiment. Left panel: Regions in the plane (x. P) where the 
algorithm fits the experimental data for fixed ao as indicated in the 
figure. Middle panel: Average accuracy for the best fit to the results of 
Experiment 2 of Kachergis et al. (2009) represented by the broken 
horizontal lines (means) and shaded regions around them (one standard 



deviation). The blue symbols represent the accuracy for the group of 
words belonging to the 12-components subgroup and the red symbols 
for the words belonging to the 6-components subgroup. All words are 
repeated exactly 6 times during the f* = 27 learning trials. Right panel: 
Parameters x and P corresponding to the best fit shown in the middle 
panel. The other parameters are N= 18 and C = 3. 



largely due to misassignments to referents whose words belong 
to the same group of the test word. In particular, Kachergis et al. 
(2009) found that this type of error accounts for 56% of incor- 
rect answers when the test word belongs to the 6-components 
subgroup and for 76% when it belongs to the 12-components 
subgroup. The corresponding statistics for our algorithm with the 
optimal parameters set at ao = 0.9 are 43% and 70%, respectively. 
The region in the space of parameters where the model can be 
said to describe the experimental data is greatly reduced in this 
experiment and even the best fit is barely within the error bars. 
It is interesting that, contrasting with the previous experiments, 
in this case the reinforcement procedure seems to play the more 
important role in the performance of the algorithm. 



The effect of the context size or within-trial ambiguity is 
addressed by the experiment summarized in Figure 4, which is 
similar to the previous experiment, except that the words that 
compose the context are chosen uniformly from the entire reper- 
toire of N = 18 words. Two context sizes are considered, namely, 
C = 3 and C = 4. In both cases, there is a large selection of 
parameter values that explain the experimental data, yielding 
results indistinguishable from the experimental average accura- 
cies. This is the reason we do not exhibit a graph akin to those 
shown in the right panels of the previous figures. Since a per- 
fect fitting can be obtained both for x. > P and for x < P, this 
experiment is uninformative with respect to these two abilities. 
As expected, increase of the within-trial ambiguity difficilitate 
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FIGURE 4 | Summary of the results of the experiments where all 
words co-occur without constraint and the N = 18 words are 
repeated exactly 6 times during the t* = 27 learning trials. Left 
panel: Regions in the plane (x,P) where the algorithm fits the 
experimental data for fixed ao and context size C = 3. Middle panel: 




Same as the left panel but for context size C = 4. Right panel: Average 
accuracy for the best fitting of the results of Experiment 2 of Kachergis 
et al. (2009) represented by the broken horizontal lines (means) and 
shaded regions around them (one standard deviation). The red symbols 
are for C = 3 and the blue symbols for C=4. 



learning. In addition, the (experimental) results for C = 3 yield 
a learning accuracy value that is intermediary to those measured 
for the 6 and 12-components subgroups, which is in agree- 
ment with the conclusion that the increase of the contextual 
diversity enhances learning, since the mean number of different 
co-occurring words is 4.0 in the 6-components subgroup, 9.2 in 
the 12-components subgroup and 8.8 in the uniformly mixed 
situation (Kachergis et al., 2009). 

4.3. FAST MAPPING 

The experiments carried out by Kachergis et al. (2012) were 
designed to elicit participants' use of the mutual exclusivity 
principle (i.e., the assumption of one-to-one mappings between 
words and referents) and to test the flexibility of a learned word- 
object association when new evidence is provided in support to 
a many-to-many mapping. To see how mutual exclusivity implies 
fast mapping assume that a learner who knows the association 
(wi, oi) is exposed to the context Q = {wi, oi, Wz, 02} in which 
the word W2 (and its referent) appears for the first time. Then it 
is clear that a mutual-exclusivity-biased learner would infer the 
association (wz, 02) in this single trial. However, a purely associa- 
tive learner would give equal weights to o\ and 02 if asked about 
the referent of Wz. 

In the specific experiment we address in this section, N = 
12 words and their referents are split up into two groups 
of 6 words each, say A = {{w\, o{) , . . . , (w^, og)} and B = 
{(W7, 07) , . . . , (w\z, 012)}. The context size is set to C = 2 and 
the training stage is divided in two phases. In the early phase, only 
the words belonging to group A are presented and the duration of 
this phase is set such that each word is repeated 3, 6 or 9 times. 
In the late phase, the contexts consist of one word belonging to 
A and one belonging to B forming fixed couples, i.e., whenever 
w, appears in a context, w,-+6> with i = 1, . . . , 6, must appear 
too. The duration of the late phase depends on the number of 
repetitions of each word that can be 3, 6, or 9 as in the early 



phase (Kachergis et al., 2012). The combinations of the sampling 
frequencies yield 9 different training conditions but here we will 
consider only the case that the late phase comprises 6 repetitions 
of each word. 

The testing stage comprises the play of a single word, say 
Wi, and the display of 11 of the 12 trained objects (Kachergis 
et al., 2012). Each word was tested twice with a time lag between 
the tests: once without its corresponding early object (01 in the 
case) and once without its late object (07 in the case). This pro- 
cedure requires that we renormalize the confidences for each 
test. For instance, in the case o\ is left out of the display, the 
renormalization is 



P t *'(wi, Oj) = P t *(wu Oj) I ^2 Pt*(Wl,Ok) 
<>k¥=°l 



(15) 



with j = 2, . . . , 12 so that ^ 0 j. 01 Pp\Wi, Oj) = 1. Similarly, in 
the case 07 is left out the renormalization becomes 



Of) = M w i< °j) I I] Pt*(wi,o k ) 

t>k¥=°7 



(16) 



with; = 1, . . . , 6, 8, . . . , 12 so that £ 0 .^ 07 Pt*'(wi,oj) = 1. We 
are interested on the (renormalized) confidences P t *' (wi, 01), 
P t *'{w\, 07), P t *'(wj, 07), and P t *'(wj, o\), which are shown in 
Figures 5, 6 for the conditions where words W{, i = 1, . . . , 6 are 
repeated 3 (left panel), 6 (middle panel), and 9 (right panel) times 
in the early learning phase, and the words W{, i = 1, . . . , 12 are 
repeated 6 times in the late phase. The figures exhibit the perfor- 
mance of the algorithm for the set of parameters x and p that 
fits best the experimental data of Kachergis et al. (2012) for fixed 
ao- This optimum set is shown in Figure 7 for the 6 early repe- 
tition condition, which is practically indistinguishable from the 
optima of the other two conditions. The conditions with the dif- 
ferent word repetitions in the early phase intended to produce 
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FIGURE 5 | Results of the experiments on mutual exclusivity in the case 
the late phase of the training process comprises 6 repetitions of each 
word. The blue symbols represent the probability that the algorithm picks 
object o-\ as the referent of word w-\ whereas the red symbols represent the 
probability it picks 07. The broken horizontal lines and the shaded zones 
around them represent the experimental means and standard deviations 



(Kachergis et al., 201 2) represented by the broken horizontal lines (means) 
and shaded regions around them (one standard deviation). The left panel 
shows the results for 3 repetitions of w\ in the early training phase, the 
middle panel for 6 repetitions and the right panel for 9 repetitions. The results 
correspond to the parameters x and f5 that best fit the experimental data for 
fixed ao . 




FIGURE 6 | Results of the experiments on mutual exclusivity in the case 
the late phase of the training process comprises 6 repetitions of each 
word. The green symbols represent the probability that the algorithm picks 
object 07 as the referent of word Wj whereas the orange symbols represent 
the probability it picks 01 . The broken horizontal lines and the shaded zones 
around them represent the experimental means and standard deviations 



(Kachergis et al., 2012) represented by the broken horizontal lines (means) 
and shaded regions around them (one standard deviation). The left panel 
shows the results for 3 repetitions of in the early training phase, the 
middle panel for 6 repetitions and the right panel for 9 repetitions. The results 
correspond to the parameters x and (S that best fit the experimental data for 
fixed ao . 



distinct confidences on the learned association {w\ , o\) before the 
onset of the late phase in the training stage. The insensitivity of 
the results to these conditions probably indicates that association 
was already learned well enough with 3 repetitions only. Finally, 
we note that, though the testing stage focused on words w\ and 
w-j only, all word pairs w, and w >+ 6 with i = 1 , . . . , 6 are strictly 
equivalent since they appear the same number of times during the 
training stage. 

The experimental results exhibited in Figure 6 offer indirect 
evidence that the participants have resorted to mutual exclusiv- 
ity to produce their word-object mappings. In fact, from the 
perspective of a purely associative learner, word W7 should be 
associated to objects 01 or 07 only, but since in the testing stage 



one of those objects was not displayed, such a learner would surely 
select the correct referent. However, the finding that P t *' (W7, 07) 
is considerably greater than P t *' (wj, 0\) (they should be equal 
for an associative learner) indicates that there is a bias against 
the association (1V7, o\) which is motivated, perhaps, from the 
previous understanding that o\ was the referent of word w\. 
In fact, a most remarkable result revealed by Figure 6 is that 
Pt* (W7, 07) < 1. Since word W7 appeared only in the late phase 
context £2 = {w\, o\, Wj, 07} and object o\ was not displayed in 
the testing stage, we must conclude that the participants pro- 
duced spurious associations between words and objects that 
never appeared together in a context. Our algorithm accounts 
for these associations through Equation (4) in the case of new 
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words and, more importantly, through eqs. (9) and (13) due 
to the effect of the information efficiency factor a t (wi). The 
experimental data is well described only in the narrow range 
a 0 € [0.85, 0.9]. 

Figure 8 exhibits the developmental timeline of the cross- 
situational learning history of the algorithm with the optimal 
set of parameters (see the figure caption) for the three differ- 
ent training conditions in the early training phase. This phase is 
characterized by the steady growth of the confidence on the asso- 
ciation (wi, o\) (blue symbols) accompanied by the decrease of 
the confidence on association (\v\ , 07) (red symbols). As the word 
W7 does not appear in the early training phase, the confidences 
on its association with any object remain constant correspond- 
ing to the accuracy value 1/11 (we recall that 0\ is left out of the 




0.0 
0.6 



FIGURE 7 I Parameters x (reinforcement strength) and p (inference 
strength) corresponding to the best fit shown in Figures 5 and 6 in the 
case word wi is repeated 6 times in the early training phase. 



display in the testing stage). The beginning of the late training 
stage is marked by a steep increase of the confidence on the associ- 
ation (W7, 07) (green symbols) whereas the confidence on (wi , 01) 
decreases gradually. A similar gradual increase is observed on 
the confidence on the association (wy, o\) (orange symbols). As 
expected, for large t all confidences presented in this figure tend to 
the same value, since the words w\ and W7 always appear together 
in the context Q, = [wi, o\, W7, 07} . Finally, we note that this 
developmental timeline is qualitatively similar to that produced 
by the algorithm proposed by Kachergis et al. (2012). 

5. DISCUSSION 

The chief purpose of this paper is to understand and model 
the mental processes used by human subjects to produce 
their word-object mappings in the controlled cross-situational 
word-learning scenarios devised by Yu and Smith (2007) and 
Kachergis et al. (2009, 2012). In other words, we seek to ana- 
lyze the psychological phenomena involved in the production 
of those mappings. Accordingly, we assume that the comple- 
tion of that task requires the existence of two cognitive abil- 
ities, namely, the associative capacity to create and reinforce 
associations between words and referents that co-occur in a 
context, and the non-associative capacity to infer word-object 
associations based on previous learning events, which accounts 
for the mutual exclusivity principle, among other things. In 
order to regulate the effectiveness of these two capacities we 
introduce the parameters x € [0, 1], which yields the reinforce- 
ment strength, and P e [0,1], which determines the inference 
strength. 

In addition, since the reinforcement and inference processes 
require storage, use and transmission of past and present infor- 
mation (coded mainly on the values of the confidences P t (w;, Oj)) 
we introduce a word-dependent quantity a f (w,) e [0, 1] which 
gauges the impact of the confidences at trial t — 1 on the update 
of the confidences at trial t. In particular, the greater the certainty 




< 0.4 




0.8 




FIGURE 8 I Knowledge development for the model parameters that best 
fit the results of the mutual exclusivity experiments summarized in 
Figures 5, 6 in the case the late phase of the training process comprises 
6 repetitions of each word. The symbol colors follow the convention used in 
those figures, i.e., the blue symbols represent the confidence on association 
( w-j , o-\ ), the red symbols on association ( w-j , 07), the green symbols on 



association (w 7 , oj) and the orange symbols on association (Wj, 01). The left 
panel shows the results for 3 repetitions of w\ in the early training phase 
(do = 0.85, x. = 0.25, p = 0.95), the middle panel for 6 repetitions (<* 0 = 0.85, 
X = 0.45, p = 0.99) and the right panel for 9 repetitions (a 0 = 0.85, x = 0.4, 
p = 0.95). For each trial f the symbols represent the average over 10 5 
realizations of the learning process. 



Frontiers in Behavioral Neuroscience 



www.frontiersin.org 



November 2013 | Volume 7 | Article 163 | 9 



Tilles and Fontanari 



Cross-situational word learning 



about the referent of word w, , the greater the relevance of the pre- 
vious confidences. However, there is a baseline information gauge 
factor ao € [0, 1] used to process words for which the uncertainty 
about their referents is maximum. The adaptive expression for 
u. t (wi) given in Equation (7) seems to be critical for the fitting 
of the experimental data. In fact, our first choice was to use a 
constant information gauge factor (i.e., a f (w,) = a Vf, w;) with 
which we were able to describe only the experiments summarized 
in Figures 1, 4 (data not shown). Note that a consequence of pre- 
scription (7) is that once the referent of a word is learned with 
maximum confidence (i.e., P t (w,-, o ; ) = 1 and Pt(w;, o^) = 0 for 
o k ^ Oj) it is never forgotten. 

The algorithm described in Section 3 comprises three free 
parameters x.> P and ao which are adjusted so as to fit a represen- 
tative selection of the experimental data presented in Kachergis 
et al. (2009, 2012). A robust result from all experiments is that the 
baseline information gauge factor is in the range 0.7 < ao < 1. 
Actually, the fast mapping experiments narrow this interval down 
to 0.85 < ao < 0.9. This is a welcome result because we do not 
have a clear-cut interpretation for ao — it encompasses storage, 
processing and transmission of information — and so the fact that 
this parameter does not vary much for wildly distinct experimen- 
tal settings is evidence that, whatever its meaning, it is not relevant 
to explain the learning strategies used in the different experimen- 
tal conditions. Fortunately, this is not the case for the two other 
parameters \ and p. 

For instance, in the fast mapping experiments discussed in 
Subsection 4.3 the best fit of the experimental data is achieved 
for P f*s 1 indicating thus the extensive use of mutual exclusivity, 
and inference in general, by the participants of those experi- 
ments. Moreover, in that case the best fit corresponds to a low 
(but non-zero) value of x.> which is expected since for contexts 
that exhibit two associations (C = 2) only, most of the disam- 
biguations are likely to be achieved solely through inference. This 
contrasts with the experiments on variable word sampling fre- 
quencies discussed in Subsection 4.1, for which the best fit is 
obtained with intermediate values of p and x. so the participants' 
use of reinforcement and inference was not too unbalanced. The 
contextual diversity experiment of Subsection 4.2, in which the 
words are segregated in two isolated groups of 12 and 6 compo- 
nents, offers another extreme learning situation, since the best fit 
corresponds to x. ~ 1 and p ~ 0 in that case. To understand this 
result, first we recall that most of the participants' errors were due 
to misassignments of referents belonging to the same group of 
the test word, and those confidences were strengthened mainly 
by the reinforcement process. Second, in contrast to the infer- 
ence process, which creates and strengthens spurious intergroup 
associations via Equation (12), the reinforcement process solely 
weakens those associations via Equation (9). Thus, considering 
the learning conditions of the contextual diversity experiment 
it is no surprise that reinforcement was the participants' choice 
strategy. 

It is interesting to note that the optimal set of parameters that 
describe the fast mapping experiments (see Figures 5-8) indi- 
cate that there is a trade-off in the values of those parameters, 
in the sense that high values of the inference parameter p require 
low values of the reinforcement parameter x- Since this is not an 



artifact of the model which poses no constrain on those values 
(e.g., they are both large for small ao), the trade-off may reveal a 
limitation on the amount of attentional resources available to the 
learner to distribute among the two distinct mental processes. 

Our results agree with the findings of Smith et al. (2011) that 
participants use various learning strategies, which in our case are 
determined by the values of the parameters x. and p, depending 
on the specific conditions of the cross-situational word-learning 
experiment. In particular, in the case of low within-trial ambi- 
guity those authors found that participants generally resorted to 
a rigorous eliminative approach to infer the correct word-object 
mapping. This is exactly the conclusion we reached in the anal- 
ysis of the fast mapping experiment for which the within-trial 
ambiguity takes the lowest possible value (C = 2). 

Although the adaptive learning algorithm presented in 
this paper reproduced the performance of adult participants 
in cross-situational word-learning experiments quite success- 
fully, the deterministic nature of the algorithm hindered 
somewhat the psychological interpretation of the informa- 
tion gauge factor a t (w;)- In fact, not only learning and 
behavior are best described as stochastic processes (Atkinson 
et al., 1965) but also the modeling of those processes requires 
(and facilitates) a precise interpretation of the model param- 
eters, since they are introduced in the model as transition 
probabilities. 
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